I. Research Objectives

The current landscape of police brutality in the city of Chicago has community leaders, policy makers, and the public asking what can be done to prevent more deaths. Data plays a central role in helping to shed light around the excessive use-of-force and racially-motivated patterns of violence.

Our foundational research objective is to analyze data regarding police misconduct and accountability in Chicago. We endeavor to understand the nature, extent, and correlations between city demographics and police misconduct. Using the datasets provided by the Invisible Institute, we hope to provide insights on how these complaints impact Chicagoans geographically and demographically.

II. Methodology

The data for our project was obtained from the Civilian Office of Police Accountability (COPA) via a Freedom of Information Act (FOIA) request made by the Invisible Institute. The data obtained contained a total of 7 datasets.

Data Preparation

Loaded Required Libraries

The first step was to load both the data and the required libraries into R.

install.packages("dplyr")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(skimr)
library(RMySQL)
## Loading required package: DBI
library(keyring)
library(odbc)
library(readxl)
library(RSQLite)
## 
## Attaching package: 'RSQLite'
## The following object is masked from 'package:RMySQL':
## 
##     isIdCurrent
library(dbplyr)
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
library(dplyr)
library(ggplot2)
library(esquisse)
library(modeldata)

Loaded Required Libraries

df_accused <- read_excel("accused.xlsx") 
df_case_info <- read_excel("case_info.xlsx") 
df_civilian_witness <- read_excel("civilian_witness.xlsx") 
df_complainant <- read_excel("complainant.xlsx") 
df_cpd_witness <- read_excel("cpd_witness.xlsx") 
df_investigators <- read_excel("investigators.xlsx") 
df_victim <- read_excel("victim.xlsx") 
Removed duplicates

The distinct_df_accused provides a dataframe for the primary key (LOG_NO) without duplicates.

distinct_df_accused <- distinct(df_accused)
count(distinct_df_accused, LOG_NO) %>% arrange(desc(n))
Trimmed White Spaces
distinct_df_accused %>% 
  janitor::clean_names()
Identified Columns with 50%+ missing values
columns_missing_most_data <- distinct_df_accused %>%
  summarise(across(everything(), ~ skimr::n_missing(.x))) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "missing_count") %>% 
  mutate(proportion_missing = round(missing_count / nrow(df_accused), 2)) %>% 
  arrange(desc(missing_count)) %>% 
  # Detect columns that were missing more than 50% of the data
  filter(proportion_missing > .5) %>% 
  pull(variable)

print(message("Columns missing 50%+ data:"))
## Columns missing 50%+ data:
## NULL
columns_missing_most_data
## [1] "NO_OF_DAYS"           "PENALTY_CODE"         "EMPLOYEE_DETAIL_UNIT"

Normalization

First Normal Form

The CPD datasets were in First Normal Form, with all entries being single-valued and atomic.

Second Normal Form

Our first step was to identify the primary keys in the dataset. The LOG_NO was the central key that tied all the datasets together.

The distinct function showed that there were 84,638 distinct LOG_NO entities in the dataset.

distinct(df_accused, LOG_NO)

A secondary primary key surfaced in the data, in the “PENALTY_ID” column. There are a total of 114,344 distinct PENALTY_IDs.

count(distinct_df_accused, PENALTY_ID) %>% arrange(desc(n))

The third primary key in the data was STAR_NO, which had a total of distince 14,395 entries.

count(distinct_df_accused, STAR_NO) %>% arrange(desc(n))

Third Normal Form

For 3NF we then grouped the data to ensure the columns were non-transitively dependent on the primary key. This included identifying: race code, sex code, penalty code, allegation code, investigation code, finding code, and location code.

Race Code

count(distinct_df_accused, RACE_CODE_CD) %>% arrange(desc(n))

Sex Code

count(distinct_df_accused, SEX_CODE_CD) %>% arrange(desc(n))

Penalty Code

count(distinct_df_accused, PENALTY_CODE) %>% arrange(desc(n))

Allegation Code

count(distinct_df_accused, ALLEGATION_CATEGORY, ALLEGATION_CATEGORY_CD) %>% arrange(desc(n))

EER Model

We analyzed a total of 7 datasets to build the EER Model in SQL. The tables were partitioned by aggrieved party, accused/CPD, allegation, and outcome.

III. Data Analysis

Allegation Data

count(distinct_df_accused, ALLEGATION_CATEGORY) %>% arrange(desc(n))

Penalty Data

select (distinct_df_accused, PENALTY_CODE)

There are a total of 225,523 cases within the accused dataset. When I filtered the data, I found that Chicago Police officers were penalized in 49,221 of those cases, which is only 21% of the time.

distinct_df_accused %>% 
  filter(! is.na(PENALTY_CODE))
49221/225523 * 100
## [1] 21.82527

176K of CPD did not have penalties listed. From the categories that were listed, 32,247 penalties were suspensions, 8,707 resulted in reprimands, 4,132 were categorized as ‘violation noted’, 4,131 faced ‘separation’.

count(distinct_df_accused, PENALTY_CODE) %>% arrange(desc(n))
distinct_df_accused %>%
 filter(!(PENALTY_CODE %in% "WORK REGULAR DAY OFF") | is.na(PENALTY_CODE)) %>%
 ggplot() +
  aes(x = PENALTY_CODE, fill = PENALTY_CODE) +
  geom_bar() +
  scale_fill_hue(direction = -1) +
  labs(
    x = "Penalty Type",
    y = "Total Penalties",
    title = "Penalty Report"
  ) +
  theme_bw() +
  theme(legend.position = "bottom")

Findings Data

count(distinct_df_accused, ACCUSED_ARRESTED) %>% arrange(desc(n))
count(distinct_df_accused, FINDING_CODE) %>% arrange(desc(n))
distinct_df_accused <- distinct_df_accused %>% 
  mutate(FINDING_CODE = ifelse(
    FINDING_CODE == "ADDITIONAL INVESTIGATION REQUESTED", "ADDITIONALINFOREQ", FINDING_CODE))
ggplot(distinct_df_accused) +
  aes(x = FINDING_CODE, fill = FINDING_CODE) +
  geom_bar() +
  scale_fill_hue(direction = 1) +
  labs(x = "Case Finding", y = "Total Number", title = "Finding Report") +
  coord_flip() +
  theme_minimal()